Simple ANOVA Examples¶
Introduction¶
This is an example of working an ANOVA, with a really simple dataset, using statsmodels
. In some cases, we perform explicit computation of model parameters, and then compare them to the statsmodels
answers. The examples are taken from "Facts from Figures" by M. J. Moroney, a Pelican book from before the days of computers.
Initialize Environment¶
%load_ext watermark
%reload_ext lab_black
%matplotlib inline
import pandas as pd
import numpy as np
import seaborn as sn
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from scipy import stats
from statsmodels.formula.api import ols
from statsmodels.formula.api import rlm
import statsmodels.api as sm
from statsmodels.stats.anova import AnovaRM
from patsy import dmatrices
from patsy.contrasts import Treatment
from patsy.contrasts import Sum
from patsy.contrasts import Diff
import warnings
Facts from Figures Chap 19¶
Dataset 1¶
This dataset is taken from "Facts from Figures", by M.J.Moroney, Page 374. We have four samples (each of five items) take from some population. We will look at the between sample variation, as opposed to the within sample variation.
We first look at the dataset in "wide" format (with one column for each sample), and create a pandas
dataframe
S1 = [2, 3, 1, 3, 1]
S2 = [3, 4, 3, 5, 0]
S3 = [6, 8, 7, 4, 10]
S4 = [5, 5, 5, 3, 2]
df_dict = {"S1": S1, "S2": S2, "S3": S3, "S4": S4}
data = pd.DataFrame(df_dict)
data
data.columns
Show the mean of each sample
for name in data.columns:
print(name, data[name].mean())
Another way to show means, usimg pandas
methods operating on the dataframe as a whole
data.mean()
Get the mean of the whole dataset (works because each sample has the same number of observation = 5)
mu = data.mean().mean()
mu
Create a new "long" dataset, where sample number is now an attribute (not a column identity). Further, we tell pandas
thet the sample number is a categorical variable, not numeric
values = (
list(data["S1"])
+ list(data["S2"])
+ list(data["S3"])
+ list(data["S4"])
)
sample = (
list(np.ones(5) * 1)
+ list(np.ones(5) * 2)
+ list(np.ones(5) * 3)
+ list(np.ones(5) * 4)
)
# get a dict holding names of columns and associated data
dict2 = {"values": values, "sample": sample}
# build dataframe
data_long = pd.DataFrame(dict2)
# declare sample to be a categorical variable
data_long["sample"] = data_long["sample"].astype(int)
data_long["sample"] = data_long["sample"].astype("category")
# show the first and last few rows
print(data_long.head(), "\n", data_long.tail())
Compute Sums of Squares¶
We create a number of sum of squares:
sst
is the sum of squares of variation across the whole dataset, against the global meanssm
is the sum of squares of the difference between the Sample Mean, and the Global Mean, for all observations (Between Sample Variation)sse
is the sum of squares of the difference between an observation Value and the Sample Mean, for all observations (Within Sample Variation)
#
# Get total sum of squares Values against Global Mean
#
mean_v = data_long["values"].mean() #
# Get the within sample sum of squares of an observation Value and the Sample Mean, for all observations
#
sst = sum(
(data_long["values"] - mean_v)
* (data_long["values"] - mean_v)
)
sst
#
# Get the between Sample sum of squares
# get the square of difference between the Sample Mean, and the Global Mean, for all observations
#
ssm = sum(
[
(mean_v - data.mean()[i - 1]) ** 2
for i in data_long['sample'].astype(int)
]
)
ssm
#
# Get the within sample sum of squares of an observation Value and the Sample Mean, for all observations
#
sse = sum(
[
(v - data.mean()[s - 1]) ** 2
for v, s in zip(
data_long['values'],
data_long['sample'].astype(int),
)
]
)
sse
- For the sample as whole, the Degrees of Freedom = 19 (20 observations)
- For the between-sample variation, the Degrees of Freedom = 3 (4 sample -1)
- For the within-sample variation, the Degrees of Freedom = 16 ( 4 sample, each of 5 observations = 4 * (5-1)
The variance estimates for each of sst
, ssm
and sse
are:
print(
f'Variance Estimates (total, between sample, within sample): {(sst / 19):10.5f}, {(ssm / 3):10.5f}, {(sse / 16):10.5f}'
)
The ratio of the between sample variance to the within sample variance (i.e. how much bigger is the spread of sample means vs spread of values in a single sample) is (under the Null Hypothesis) an F statistic
print(f' F value = {(ssm / 3)/((sse / 16)):10.5f}')
Visualize Dataset¶
We use pandas
platting methods to show an overlapping set of histograms, one for each sample, and one for the total dataset. It turns out that this is not a startlingly good way of representing the data
fig, ax = plt.subplots(figsize=(8, 8))
# show histogram for whole dataset
data_long.hist('values', ax=ax, alpha=0.2, label='All Data')
# show histogram for each sample
for i in [1, 2, 3, 4]:
data_long[data_long['sample'] == i].hist(
'values', ax=ax, alpha=0.5, label='Sample ' + str(i)
)
# end for
ax.legend(loc='best')
Lets try again, with the old standard, a scatter plot. We show global mean, and each sample mean, and each datapoint
_ = plt.plot(
data_long['sample'].astype(int),
data_long['values'],
'o',
alpha=0.5, # allows overlapping points to be distinguished
label='Data',
)
_ = plt.axhline(
data_long['values'].mean(),
color='k',
label='Global Mean',
)
for i in [1, 2, 3, 4]:
plt.plot(
i,
data_long[data_long['sample'] == i][
'values'
].mean(),
'*',
label='Sample Mean ' + str(i),
alpha=1,
markersize=10,
)
# end for
_ = plt.legend(loc='best')
Fit Linear Model¶
Based upon the graphs above, we first normalize the dataset, by subtracting the Global Mean from each value.
We then specify a model
- value s= beta_1 sample_1 + beta_2 sample_2 + beta_3 sample_3 + beta_4 sample_4 + Normal(0,Std_err)
where sample_i = 1 if the observation is in the i_th sample, and 0 otherwise (known as one-hot type encoding)
When we call ols
, we have to specify that we do not want an intercept term by adding "-1" to the formula we use
Under the Null Hypothesis (H0), each observation of each samples are drawn from the same distribution, so beta_i will 0, each each i in 1,2,3,4
data_long_z = data_long.copy()
data_long_z['values'] = data_long_z['values'] - mean_v
res2_2 = ols("values ~ sample -1", data=data_long_z).fit()
res2_2.summary()
We can see from the P>|t|
column above that sample 2 and 4 do not have values significantly (at 5%) different from the overall population, but samples 1 and 3 do appear to differ
We can confirm various values held in the RegressionResults object
Compare to
Variance Estimates (total, between sample, within sample): 6.10526, 23.33333, 2.87500
print(
f' Mean Square Total {res2_2.mse_total:10.5f}'
)
print(
f' Mean Square Between Samples {res2_2.mse_model:10.5f}'
)
print(
f' Mean Square Within Samples {res2_2.mse_resid:10.5f}'
)
The standard error for a sample of 5 items will be a factor of 1/sqrt(5)
smaller the the standard error for a single item (the RegressionResults.summary()
displayed value agrees)
print(
f' Mean of Sample of 5 Standard Error = {np.sqrt(res2_2.mse_resid / 5):10.5f}'
)
Get probability of getting sample means as far from zero as we actually have. Coefficient 1 and 3 are very unlikely to have been drawn from a population having the same mean as the rest of the samples. This agrees with the results in the RegressionResults summary above
(
stats.t.sf(2.638, 16) * 2,
stats.t.sf(1.319, 16) * 2,
stats.t.sf(3.956, 16) * 2,
stats.t.sf(0, 16) * 2,
)
Get 5% confidence limits
lower_ci = 0.758 * stats.t.ppf(0.025, 16)
upper_ci = 0.758 * stats.t.ppf(0.975, 16)
lower_ci, upper_ci
for i in [1, 2, 3, 4]:
plt.plot(
i,
res2_2.params.iloc[i - 1],
'bo',
label='Sample Coeff ' + str(i),
alpha=1,
markersize=10,
)
plt.plot(
i,
res2_2.params.iloc[i - 1] + lower_ci,
'go',
alpha=1,
markersize=10,
)
plt.plot(
i,
res2_2.params.iloc[i - 1] + upper_ci,
'go',
alpha=1,
markersize=10,
)
plt.plot(
[i, i],
[
res2_2.params.iloc[i - 1] + lower_ci,
res2_2.params.iloc[i - 1] + upper_ci,
],
'r-',
alpha=1,
markersize=10,
)
# end for
_ = plt.axhline(0, color='k')
plt.xlabel('Sample Number')
plt.ylabel(
'Coefficient of Normalized Sample\n(= Normalized Sample Mean) '
)
We get the same results when we get predictions from our OLD linear model
for i in [1, 2, 3, 4]:
gp = res2_2.get_prediction({'sample': i})
print(gp.summary_frame())
# end for
Non-Normalized Model¶
We could choose not to normalize our model to zero mean (at the small inconvenience of making the coefficients a little harder to understand; they are now offsets from the mean of Sample 1), and we would get exactly the same results
res2_3 = ols("values ~ sample ", data=data_long).fit()
res2_3.summary()
# get predictions of the mean for all our samples
for i in [1, 2, 3, 4]:
gp = res2_3.get_prediction({'sample': i})
print(gp.summary_frame())
# end for
The Linear Model ANOVA method gives the same F value as we explcitly calculated above.
sm.stats.anova_lm(res2_3)
DataSet 2¶
This is taken from "Facts from Figures", page 380. We have data on price bid on the same goods, packaged in 4 different ways. We wish to determine if package affects price significantly.
Set up our pandas
dataframe, declaring pack
to be a categorical variable
value = (
[66, 82, 60, 50, 60, 90]
+ [42, 66, 30, 60, 36, 48]
+ [54, 90, 60, 81, 60, 51]
+ [78, 54, 60, 42, 71, 49]
)
pack = (
list(np.ones(6) * 1)
+ list(np.ones(6) * 2)
+ list(np.ones(6) * 3)
+ list(np.ones(6) * 4)
)
df_dict = {"value": value, "pack": pack}
data = pd.DataFrame(df_dict)
data["pack"] = data["pack"].astype("category")
data.head()
Visualize our datasets, plotting value against package numbre , for all our observations. We also mark our within package means (stars) and global mean (black line).
_ = plt.plot(data['pack'].astype(int), data['value'], 'o')
for p in data['pack'].unique():
plt.plot(
p,
data[data['pack'] == p]['value'].mean(),
'*',
markersize=20,
)
# end for
_ = plt.axhline(data['value'].mean(), color='k')
_ = plt.xlabel('Pack Number')
_ = plt.ylabel('Assessed Value')
Fit a Linear Model¶
res3 = ols("value ~ pack", data=data).fit()
res3.summary()
for i in [1, 2, 3, 4]:
gp = res3.get_prediction({'pack': i})
print(gp.summary_frame())
# end for
Perform ANOVA¶
We see that the F value this large would be seen about 9 times out of 100, so at the 5% level, we cannot reject the Null Hypothesis that packaging has no effect on valuation of the goods.
anv = sm.stats.anova_lm(res3)
anv
Explicit F Calculation¶
#
# Get total sum of squares Values against Global Mean
#
mean_v = data["value"].mean()
sst = sum(
(data["value"] - mean_v) * (data["value"] - mean_v)
)
sst
#
# Get the between Package sum of squares
# get the square of difference between the Package Mean, and the Global Mean, for all observations
#
ssm = sum(
[
(mean_v - data[data['pack'] == i]['value'].mean())
** 2
for i in data['pack'].astype(int)
]
)
ssm
#
# Get the within Package sum of squares of an observation Value and the Package Mean, for all observations
#
sse = sum(
[
(v - data[data['pack'] == p]['value'].mean()) ** 2
for v, p in zip(
data['value'], data['pack'].astype(int)
)
]
)
sse
- For the sample as whole, the Degrees of Freedom = 23 (224 observations)
- For the between-sample variation, the Degrees of Freedom = 3 (4 packages -1)
- For the within-sample variation, the Degrees of Freedom = 20 ( 4 packages, each of 6 observations = 4 * (6-1))
print(
f'Variance Estimates (total, between sample, within sample): {(sst / 23):10.5f}, {(ssm / 3):10.5f}, {(sse / 20):10.5f}'
)
print(f' F value = {(ssm / 3)/((sse / 20)):10.5f}')
The explicitly computed F value agrees with the statsmodels
results
Two Levels of ANOVA¶
See "Facts from Figures", page 385. In this example, we have sales results for four different persons (A, B ,C, D) each selling at three different locations (C = CBD, O = Outskirts, S = Shopping Centre). The sales results are normalized before the analysis, but this doesn't affect the analysis of significance.
Create our pandas
dataframe
sales = [30, 70, 30, 30, 80, 50, 40, 70, 100, 60, 80, 80]
C = 'C'
O = 'O'
S = 'S'
district = [C, C, C, C, O, O, O, O, S, S, S, S]
A = 'A'
B = 'B'
C = 'C'
D = 'D'
person = [A, B, C, D, A, B, C, D, A, B, C, D]
df_dict = {
'sales': sales,
'district': district,
'person': person,
}
data4 = pd.DataFrame(df_dict)
Visualize the effect different salespersons and locations have on the observations
_ = sm.graphics.interaction_plot(
data4['person'], data4['district'], data4['sales']
)
_ = sm.graphics.interaction_plot(
data4['district'], data4['person'], data4['sales']
)
Normalize the observations (originally to make manual calculation easier - I told you this was old-school)
data4['sales'] = (data4['sales'] - 50) / 10
data4
Compute the global mean
mean_v = sum(data4['sales']) / len(data4['sales'])
mean_v
Get the total sum of squares, and its degrees of freedom
total_ss = sum(
[
(mean_v - data4['sales'].iloc[i]) ** 2
for i in range(len(data4))
]
)
total_df = len(data4) - 1
total_ss, total_df
Get the between Person sum of squares, and its degrees of freedom
person_ss = 0
for i in range(len(data4)):
name = data4['person'].iloc[i]
person_ss = (
person_ss
+ (
mean_v
- data4[data4['person'] == name]['sales'].mean()
)
** 2
)
# end for
person_df = len(data4['person'].unique()) - 1
person_ss, person_df
Get the between Districts sum of squares, and its degrees of freedom
district_ss = 0
for i in range(len(data4)):
name = data4['district'].iloc[i]
district_ss = (
district_ss
+ (
mean_v
- data4[data4['district'] == name][
'sales'
].mean()
)
** 2
)
# end for
district_df = len(data4['district'].unique()) - 1
district_ss, district_df
Get the residual sum of squares, and its degrees of freedom
residual_ss = total_ss - person_ss - district_ss
residual_df = total_df - person_df - district_df
residual_ss, residual_df
Get the estimated variance for between Person, between Districts, and Residual
person_var = person_ss / person_df
district_var = district_ss / district_df
residual_var = residual_ss / residual_df
person_var, district_var, residual_var
Declare to pandas
(and hence to statsmodels
) that person
and district
are categorical variables. Then fit a linear model by OLS
data4['person'] = data4['person'].astype('category')
data4['district'] = data4['district'].astype('category')
res_sales = ols(
'sales ~ district + person', data=data4
).fit()
res_sales.summary()
sm.stats.anova_lm(res_sales)
The conclusion is that the observed F values are not strong enough evidence to reject the Null Hypothesis that there is no affect of salesperson or district on sales results.
Environment¶
%watermark -h -iv
%watermark